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Abstract —This article introduces a robust hybrid method 
for solving supervised learning tasks, which uses the Echo 
State Network (ESN) model and the Particle Swarm Optimization 
(PSO) algorithm. An ESN is a Recurrent Neural Network with 
the hidden-hidden weights fixed in the learning process. The 
recurrent part of the network stores the input information in 
internal states of the network. Another structure forms a free- 
memory method used as supervised learning tool. The setting 
procedure for initializing the recurrent structure of the ESN 
model can impact on the model performance. On the other 
hand, the PSO has been shown to be a successful technique 
for finding optimal points in complex spaces. Here, we present 
an approach to use the PSO for finding some initial hidden- 
hidden weights of the ESN model. We present empirical results 
that compare the canonical ESN model with this hybrid method 
on a wide range of benchmark problems. 

Key words -Recurrent Neural Networks; Particle Swarm Op¬ 
timization; Echo State Network; Reservoir Computing; Time- 
series problems 

I. Introduction 

A Recurrent Neural Network (RNN) is a powerful tool 
for time-series modeling ID. It has been used for solving 
supervised temporal learning tasks as well as for information 
processing in biological neural systems m, in. The recur¬ 
rent topology of the network ensures that a non-linear trans¬ 
formation of the input information can be stored in internal 
states ID. In spite of that, recurrent networks present some 
limitations for solving real-world applications m. They can 
present high computational costs during the training process 
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when a lst-order learning algorithm is used (for instance: 
gradient descent algorithm type) 0. During the 90s much 
effort was devoted to identify the learning problems of the 
RNNs. 

At the beginning of the 2000s two models were intro¬ 
duced for designing and training RNNs. They were indepen¬ 
dently developed and named Echo State Network (ESN) 11 
and Liquid State Machines HI. Since 2007 this trend has 
started to be popularly known under the name of Reservoir 
Computing (RC) 0 . The RC approach is an attempt to 
resolve the limitations in the training, which overcome the 
limitations of convergence time. A RC model is a RNN 
with the particularity that the weights involved in cyclic 
connections are deemed fixed during the training process. 
The recurrent structure of the network is called reservoir 
and it is composed by the hidden-hidden weights. Another 
structure of the model called readout refers to the weight 
connections free of recurrences in the network, in graph 
terms the readout is composed by the free-circuit weights. 
Only the readout weight are adapted in the adjusted in the 
learning process. 

Even though RC methods have been successfully used for 
solving temporal tasks, the tuning of their parameters can 
be difficult. The initialization of the reservoir parameters 
often requires the human expertise and several empirical 
trials. Over the last years, several approaches have been 
studied for the reservoir design. An analysis of the intrinsic 
plasticity for the ESN model was presented in 0 . A specific 
kind of RC methods uses topographic maps for initializing 
its weights 0-®. Besides, an evolutionary algorithm was 
used for designing the reservoir 0. Additionally, other 
metaheuristic techniques were applied for optimizing the 
reservoir global parameters, topology and reservoir weights 
was studied in m-m. 

The Particle Swarm Optimization (PSO) is an efficient 
and widely used metaheuristic for finding optimal regions 
on complex spaces. The PSO was applied for defining the 



spectral radius, the kind of transfer function, the reservoir 
size and the presence of feedback connections Q3). In this 
paper, we modify the way of using PSO to construct the 
reservoir with respect to the approach presented in lfl3l . We 
adjust a subset of the reservoir weights, the rest of weights 
are kept fixed during the training as usual in RC models. 
Our hypothesis is that it is enough to tune few weights of 
the reservoir using the PSO algorithm, in order to improve 
the ESN performance in terms of computational time and 
accuracy rate. This strategy obtains good experimental re¬ 
sults, without requiring operations with high computational 
cost (for instance: it avoids to compute the spectral radius 
of the reservoir matrix). 

This article is structured as follows. Section QT] presents a 
background of the two main models used in this work: ESN 
and PSO. Section [III] contains the contribution of this work. 
Next, we present our experimental results, and then we go 
for final conclusions and future work. 

II. Background 

In this Section, we specify the context where the ESN 
models are applied. An ESN model is mainly used for 
solving supervised learning tasks, wherein the data set 
presents temporal dependencies, although it can be also used 
for non-temporal supervised learning problems (T). Besides, 
we present a description of both the ESN tool and the PSO 
technique. 

A. Problem Specification 

Given a training set composed by pairs of discrete-time 
vectors (a(f), b(f)), a(f) G and b(f) G R' Vb for all t in 
an arbitrary interval of time; the goal in a supervised learning 
task is finding a parametric mapping </>(•) such that a distance 
function is minimized. This distance function measures the 
deviation of the </>(•) predictions from the target values b. 
Examples of distance functions are the square error and the 
Kullback-Leibler distance m. In this article the mapping is 
given by the ESN model and we evaluate it using the square 
error distance. 

B. Basic Description of the Echo State Network Model 

The ESN model is a Neural Network composed by a 
hidden recurrent structure (called reservoir ) and a readout 
structure that is a linear regression. The reservoir role’s 
consists of encoding the temporal information of the input 
data. Besides, the reservoir provides a complex nonlinear 
transformation of the input patterns, which enhances the 
linear separability of the input data. The readout structure 
is used for supervised training adaptation. In the canoni¬ 
cal ESN tool the readout structure is a linear regression 
model (TJ. 

We follow the previous notation concerning the training 
set. We use the notation for the components of the ESN 
model presented in Q). The training set is collected in the 


pairs (a(f), b(f)), t = 1,..., T. A vector x(f) represents the 
reservoir state at each time t. We denote by N H . N x and Aj> 
the dimensions of the vectors a, x and b, respectively. In the 
canonical ESN, the transfer function of the reservoir neurons 
is the tanh(-) function. The reservoir state is computed as 
follows: 

N a N x , 

Co + - i)) > 

i=l i— 1 ' 

(1) 

Vm G [1,W X ], where the weight connections between input 
and reservoir nodes are given by a N x x (AT a + 1) weight 
matrix w ln , the connections among the reservoir neurons are 
represented by a N x x N x weight matrix w 1 ’ and a iVj, x 
(N x +N a +1) weight matrix w out represents the connections 
between reservoir and output units. 

The amount of reservoir units is much larger than the 
dimensionality of the input space (AT a -C N x ) (T). We 
denote by a vector y (t) the model output at time t, which 
is generated by a linear regression as follows: 

A a N x 

ym{t)= «Co + ®™ U i a i (t) + W ™i Xi (*) ’ ( 2 ) 

i= 1 i =1 

Vm G [1, Mb]- 

C. The Particle Swarm Optimization Technique 

The Particle Swarm Optimization (PSO) method is an 
algorithm for finding optimal points on complex search 
spaces m. The technique is based on social behaviors of 
a set of particles (swarm) in a simplified environment. The 
procedure searches for optimal points on a multidimensional 
space by adjusting vectors that represent particle positions. 
The update rule of trajectories is inspired on social interac¬ 
tions. 

More formally, let N be the number of particles in the 
system and M the dimension of the search space. Each 
particle i is characterized by a pair (xj,Vj), Xi,v,; G 1R M . 
Metaphorically speaking, the vector x, represents the po¬ 
sition of i and v, represents the velocity of i. We denote 
by p; (t) the best position of i ever found at time t. Let p* (f) 
be a vector with the information of the best swarm position 
that has ever found until time t. The algorithm is iterative, at 
each epoch the objective function (function to be optimized) 
is evaluated, next the vectors x, and v, are updated for 
all i. At any time t, the system dynamics are given by the 
expressions ffiD: 

v»(f+l) = tv, (f )+<$* (f) (pi (f) — Xj(£)) +5? (f) (p*(£) —Xj (f )), 

(3) 

and 

Xi(i + 1) =x i (i) + v i (i + l), (4) 

where the parameter l G (0,1) is called the inertia, 5 1 and 5 2 
are two diagonal matrices. The inertia controls the tradeoff 
between exploitation and exploration on the search space. 


The diagonal elements 6j and 8'f are uniformly distributed 
in [0,y>i] and [0, ^ 2 ]- respectively. These matrices weight 
the relationship between individual positions and the “good” 
local and global position. For this reason, the parameters ip 1 
and ip 2 are called the acceleration coefficients. A pseudo¬ 
code of the PSO technique is presented in Algorithm Q] 


Algorithm 1 Specification of the Particle Swarm Optimiza¬ 
tion used for finding the weight matrix of the reservoir. 

t = t 0 ; 

Initialize population (x*, v,;)(i), \/i; 

Evaluate F(xj),Vi; 

Set p*(t) and p,(i) for all i; 
while (termination criterion is not satisfied) do 
for (each particle i) do 

Compute v ? ; (t +1) using d3j; 

Compute x ? ; (t +1) using (J4]»; 

Evaluate -F(x.j); 

Update local best p i(t + 1); 
end for 

Update global best p*(i + 1); 

t — t - f- lj 

end while 

Return p*(f); 


IIL The PSO for Setting the ESN Model 

The performance of the ESN model basically depends of 
the following global parameters: the input scaling factor, 
the reservoir size, the spectral radius of the reservoir matrix 
and the topology of the reservoir network. The input scaling 
factor controls the impact of the inputs over the reservoir 
state 05). In the RC literature has been used a large sparse 
pool of interconnected neurons in the reservoir. A reservoir 
projection in a larger space improves the model accuracy, 
although there is a tradeoff to reach in the reservoir size. A 
too large reservoir can provoke the over-fitting phenomenon. 
The spectral radius impacts on the stability and chaoticity 
of the reservoir dynamics, as a consequence it influences 
on the memory capability of the model. The stability of the 
ESN reservoir is guaranteed when the spectral radius is less 
than 1, this stability condition was established in the Echo 
State Property (ESP) 0. According to previous experiences, 
it has not been clear what the impact of the reservoir density 
would be on the model accuracy. Although, sparse matrices 
process the information faster than dense matrices, as a 
consequence a sparse reservoir can improve performance in 
time & El Recently, an evolutionary algorithm was used 
to find the reservoir size, the spectral radius and the density 
of the reservoir matrix j9). In addition, evolutionary and 
genetic algorithms were applied for optimizing the reservoir 
global parameters and for designing the connectivity of the 
reservoir ifTOl — lfl2l . 


The PSO technique was already used for defining the 
spectral radius and other main parameters of the reservoir 
in na. Nevertheless, it is known that different reservoirs 
with the same spectral radius can have a substantial vari¬ 
ance in the model accuracy 0. In recurrent topologies, 
to compute the eigenvalues modulus can be not-robust and 
computational expensive. The converge rate of the spectrum 
computation is determined by how close certain eigenvalues 
are to zero. Besides, the operation of rescaling the reservoir 
matrix by the spectral radius has a high computational 
cost IIT2l . 

In this article, we propose a hybrid method which uses the 
PSO for adjusting a subset of the reservoir weights without 
requiring to compute the spectrum of the reservoir matrix. 
We do not use the PSO for finding the spectral radius, and 
the other global parameters. 

The weights can be classified into the following cate¬ 
gories: input weights, random reservoir weights, reservoir 
weights adjusted by PSO and the readout weights. We denote 
by O" 1 the set of input weights that are collected in the 
matrix w m , we denote by O' the reservoir weights that 
are collected in the matrix w r , and we denote by O 0,lt 
the readout weights collected in the matrix w out . Let Sl h 
be the subset of the reservoir weights (fl h C O' ) that are 
adjusted using the PSO method. The weights in O' 1 are 
hidden weights randomly selected from O r . The relationship 
between the cardinality of O h and fl r is given by |O h | = 
a|fl r | where a € (0,1) and | • | is the cardinality function of 
a set. The parameter a is empirically estimated. Figure Q] 
presents an example of the different kind of parameters, 
wherein O h and O ollt are represented by blue dashed and 
dotted lines, respectively. Other weights are represented by 
black solid lines. Only the blue weights are adjusted in this 
approach. In summary, the procedure to train this hybrid 
model is presented in [2] 



Figure 1: An example of the topology of the PSO-ESN 
model. A solid black line represents fixed weight during the 
learning process, blue dashed lines represent the weights 
adjusted by the PSO, and blue dotted lines represent the 
readout weights adjusted using a memoryless supervised 
learning method (for instance: linear regression model). 











Algorithm 2 Pseudo-algorithm of the PSO-based phase for 
setting the ESN model. 

Initialize the PSO parameters: M, N, l, 6 1 , S 2 ; 

Initialize the weights fI ln and V. r using a random distri¬ 
bution; 

Select C iV using a random distribution; 

repeat 

Apply the PSO for optimizing O h (Algo.Q]): 

Compute the w out using linear ridge regression; 
Evaluate the accuracy of the model; 
until criterion is satisfied 
Return the network weights; 


IV. Empirical Results 

In this section, we provide the performance of the canon¬ 
ical ESN model and the hybrid method introduced in the 
precedent section on four benchmark experiments. We use 
the acronym PSO-ESN for denoting the procedure proposed 
in this work. We call epoch to an iteration of the training 
algorithm through all the examples in the training set. 
In order to have statistically significant results, we run 
each model on each benchmark using 30 different random 
initializations. In the case of the PSO algorithm, for each 
benchmark test we use a grid points of values M and N. 

We compare the following procedures: 

• ESN: we initialize the network weights using an Uni¬ 
form random distribution U[w m i n ,w m ax]- The topol¬ 
ogy consists of a network with three fully connected 
layers (input, reservoir and output layer). We control 
the density of the reservoir and the spectral radius 
module. We rescale the weights of w r using the spectral 
radius in order to ensure the ESP. We project the input 
space using the reservoir. Next, we compute the read¬ 
out weights using the training set and standard ridge 
regression. We repeat the experiment evaluating the 
performance for several spectral radius of w r values. 
In our experiments, the reservoir size and density are 
fixed. 

• PSO-ESN: we initialize the network weights fl ln 
and fl r using a uniform random distribu¬ 
tion U[vj m i n ,w m ax)- Next, we random select a 
subset fl h such that fl h C El 1 . Then, we apply the PSO 
for setting the weights in El h . In this step we consider 
the Mean Square Error (MSE) as fitness function in 
the PSO algorithm. Finally, we use the training set for 
computing the readout weights. 

The statistical comparison between the accuracy reached by 
the two methods was realized using confidence intervals. 
We use asymptotic confidence intervals of the mean of the 
accuracy reached on the different experiments. 

The remains of this section includes a description of the 
data set, the experimental setting and the reached results. 


A. Description of the Benchmarks 

We use the following range of benchmark problems. The 
first data set is an experimental data measured with a LeCroy 
oscilloscope, the patterns corresponds to the intensity pulsa¬ 
tions of a laser. This benchmark is often called as the Santa 
Fe Laser data. The data is a cross-cut through periodic to 
chaotic intensity laser pulsations, which more or less follow 
the theoretical Lorenz model of a two level system fT8l . The 
task consists to predict the next laser pulsation b(t+ 1 ), given 
the precedent values up to t. The original data only consists 
of 1000 measurements, we use 499 for training and 500 
for test. We use a washout of 30 samples. The initial input 
weights are in [— 0 . 8 , 0 . 8 ] and the initial reservoir weights 
are in [—0.2, 0.2]. The regularization parameter ( 7 ) used for 
computing the readouts was set with 0.001. The reservoir 
size has 50 units, the spectral radius and the sparsity of the 
reservoir matrix were 0.9 and 0.3, respectively. 

The Nonlinear Autoregressive Moving Average 
(NARMA) is a widely studied benchmark problem a, 
fi~2l , fl9l , ll20l . The interests of this data is based on the 
high degree of chaos in its dynamics. Additionally, the data 
can present long-range dependency, as a consequence to 
learn patterns on the training set is a difficult task 0. The 
sequence of patterns is generated by the expression: 

k -1 

b(t+l) = cib(t)+C2b(t)^2b(t - i)+c 3 s(t-(k-l))s(t)+C 4 

i=0 

(5) 

where s(t) ~ 17[0, 0.5] and the constants values are <7 = 
0.3, C 2 = 0.05, C 3 = 1.5 and c\ = 0.1. The data set was 
rescaled in [0,1]. In order to evaluate the memory capability 
of the model, we consider two simulated NARMA series 
with k = 10 and k = 30. For the case of 10th order 
NARMA, we generate a training data with 1990 samples and 
a test set with 390 samples. The 30th order NARMA training 
set has 2772 samples and the test set has 1428 patterns. 
The 70% of the weight connections among reservoir units 
are zeros. The reservoir size is 150 units for the 10th order 
NARMA and 200 units for the another NARMA benchmark. 

The last benchmark problem refers to the traffic prediction 
on the Internet. The data is from an Internet Service Provider 
(ISP) working in 11 European cities. The original data was 
collected in bits using a time interval of five minutes. The 
size of the training and test data set are 9848 and 4924. The 
goal is to predict the Internet traffic at time t + 1 using the 
information from t — 6 up to time t. More details about this 
data set and a forecasting analysis can be seen in Ii2l1l - ll23ll . 

B. First Results 

Table [I] summarizes the accuracy reached by the PSO-ESN 
on the experiments. First column identifies the benchmark 
task and second column refers to the dimension of each 
particle in the PSO technique. Last two columns indicate 
the performance of the PSO-ESN. Third column is the MSE 





average performed on 30 different initializations and fourth 
column is the standard deviation of the MSE computed on 
the different initializations. 

Table QI] presents the performance of the ESN model. 
Second column shows the mean of accuracy reached on 
the 30 trials and the third columns refers to the standard 
deviation of this error measures. TableUIIlshows the accuracy 
reached for both models on the training and test Internet 
traffic data set. The real values in the tables are written using 
the scientific notation form. 

We can generate a confidence interval (Cl) of the 
MSE [e m i n , e max \ using the standard deviation of the ta¬ 
bles [J and |n| Let [e% n , e 2 nax j be the Cl for the MSE 
obtained with the PSO-ESN method, and let [e 2 iirl , e-max] be 
the Cl computed for the MSE reached for the ESN model. 
Note that, if we generate 95% Cl considering an approxima¬ 
tion normal distribution, then [e l rnln , e^ ax \ and [e 2 min; , e 2 max } 
are distinct intervals. Specifically, we have e [ max < e 2 mvn for 
the four benchmarks studied in this work. 

Figure |2] shows the different accuracy reached for both 
models with the Laser data set. Red lines corresponds to 
the ESN model and blue lines refers to the PSO-ESN. The 
figure shows the error obtained with the training and test 
data set versus different initializations. 

Figure [3] illustrates the influence of the parameter M 
on the accuracy of the PSO-ESN model. This parameter 
represents the dimension of each particle of the swarm, this 
means the reservoir weights that are adjusted using the PSO. 
According to the figure, we can see that larger M values 
reached better accuracy. For instance, in the Figure [3] for a 
number of epochs equal to 60 the line at the top corresponds 
to M = 5 and the line at the bottom corresponds to M = 30 
(the order of lines from top to bottom is M = 5,10,15, 20 
and 30). On the other hand, a larger search space (larger 
value of M) can increase the running time and can cause the 
over-fitting phenomenon. According our empirical results, 
it is enough to have M = N x /5 to have better accuracy 
than the ESN model rescaling the reservoir weights with 
the spectral radius. 

Table I: Performance of the PSO-ESN hybrid method. 
Performance of the test data set for the Laser and NARMA 
benchmark problems. The second column corresponds to 
the dimension of each particle in the PSO algorithm. The 
columns 3 and 4 are the average and the standard deviation 
of the accuracy obtained for the 30 experiments. 


Data set 

M 

Mean 

Stdv 

Laser data 

5 

8.6657 x 10~ 4 

9.3541 x 10~ 9 

10th NARMA 

15 

2.0462 x 10~ 4 

1.9416 x 10" 9 

10th NARMA 

30 

1.9623 x 10~ 4 

1.4595 x 10” 9 

30th NARMA 

40 

1.3247 x 10~ 2 

1.5476 x 10- 7 


Table II: Performance of the ESN model. Performance of 
the test data set for the Laser and NARMA benchmark 
problems. The columns 3 and 4 are the average and the 


standard deviation 
experiments. 

of the accuracy 

obtained for the 

Data set 

Mean 

Stdv 

Laser data 

1.5220 x 10 -a 

3.0711 x 10~ s 

10th NARMA 

2.0538 x 10 -4 

1.6871 x 10~ 9 

30th NARMA 

1.4025 x 10" 2 

1.2221 x 10- 7 


Table III: Performance of both PSO-ESN and ESN for the 
train and test data set for the Internet traffic prediction. The 
columns 3 and 4 are the average and the standard deviation 
of the accuracy obtained for the 30 experiments. 


Method 

Data set 

Mean 

Stdv 

PSO-ESN 

Train 

7.1613 x 10" s 

1.6245 x 10- 15 

ESN 

Train 

1.2802 x 10~ 7 

1.6833 x 10 -14 

PSO-ESN 

Test 

1.5293 x 10 -6 

1.3340 x 10~ 12 

ESN 

Test 

2.5661 x 10 -6 

3.6516 x 10~ 12 



Figure 2: Accuracy of several initializations of the PSO- 
ESN and ESN for the train and test set of Laser data set. 
The reservoir has 50 units, M = 5 and the PSO has 10 
particles. In spite that the epochs are independent of each 
other, for a better visualisation we draw a continuos curve 
for the testing experiment and dashed curves for the training 
experiments. 

V. Conclusions and Future Work 

In this article we present a method that uses the Particle 
Swarm Optimization (PSO) for initialization of the Echo 
State Networks (ESN) is proposed for solving temporal 
supervised learning tasks. The ESN model is an efficient 
technique to train and design a Recurrent Neural Network. 
On the other hand, the PSO algorithm has been successfully 
used for optimizing continuous functions. 































Figure 3: Example of the evolution of the MSE of the 
training set according to M for the 10th order NARMA 
data set. The MSE is for one specific random initialization 
of the weights. The reservoir has 150 units and the PSO has 
35 particles. 


Over the last years, several approaches have been pre¬ 
sented for designing the reservoir. In this contribution, we 
use the PSO for adjusting a subset of the reservoir weights. 
To tune all the reservoir weights using meta-heuristics can 
be a very expensive task. As a consequence, a subset of the 
reservoir weights is randomly selected and adjusted using 
the PSO. The setting of the reservoir weights is realized in 
an automatic way using the PSO. Besides, the procedure 
does not require to compute the spectrum of the reservoir 
matrix, which is a computational expensive operation. 

As a for future work, we can extend the same procedure 
to other Reservoir Computing methods. As well as, we are 
interesting in comparing the performance reached by the 
PSO algorithm with other bio-inspired techniques. 
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